Skip to content

Add TMA TensorMapDescriptor support#1687

Open
rparolin wants to merge 19 commits intoNVIDIA:mainfrom
rparolin:rparolin/tma_feature
Open

Add TMA TensorMapDescriptor support#1687
rparolin wants to merge 19 commits intoNVIDIA:mainfrom
rparolin:rparolin/tma_feature

Conversation

@rparolin
Copy link
Collaborator

@rparolin rparolin commented Feb 24, 2026

Summary

  • Add TensorMapDescriptor Cython class wrapping the CUDA driver's CUtensorMap for Hopper+ TMA (Tensor Memory Accelerator) bulk data movement
  • Support tiled and im2col descriptor creation via from_tiled() and from_im2col() class methods, with automatic dtype inference, stride computation, and validation
  • Integrate TensorMapDescriptor as a first-class kernel argument in _kernel_arg_handler.pyx
  • Add comprehensive tests (test_tensor_map.py) and an example (tma_tensor_map.py)

Closes #199
Closes #200

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Feb 24, 2026

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

This comment was marked as resolved.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (2)

cuda_core/pixi.toml:67

  • Removing the cu12 environment from this subproject can break the repository’s top-level pixi run -e cu12 test workflow, which runs pixi run --manifest-path cuda_core test under the propagated PIXI_ENVIRONMENT_NAME=cu12. If cu12 testing is still expected at the workspace level, consider keeping a solvable cu12 environment here (e.g., using conda-forge cuda-bindings/cuda-version constraints instead of the path dependency) or updating the workspace test tasks to avoid selecting a missing environment.

# NOTE: cu12 environment is intentionally omitted because the path dependency
# to ../cuda_bindings (v13.1) makes it unsolvable locally. For cu12 testing,
# use conda-forge packages or CI workflows.
[environments]
default = { features = [
    "cu13",
    "test",
    "cython-tests",
], solve-group = "default" }
cu13 = { features = ["cu13", "test", "cython-tests"], solve-group = "default" }

cuda_core/cuda/core/_tensor_map.pyx:461

  • c_pixel_box_lower / c_pixel_box_upper are declared as fixed-size int[3] but only the first n_spatial entries are written. If the driver implementation reads all 3 entries (the API supports up to 3 spatial dims), the remaining uninitialized values can make encoding nondeterministic. Initialize the full arrays (e.g., set all 3 to 0 first) before filling the active elements.
        cdef uint64_t[5] c_global_dim
        cdef uint64_t[4] c_global_strides
        cdef uint32_t[5] c_element_strides
        cdef int[3] c_pixel_box_lower  # max 3 spatial dims (rank 5 - 2)
        cdef int[3] c_pixel_box_upper
        cdef int i_c

        for i_c in range(rank):
            c_global_dim[i_c] = <uint64_t>shape[rank - 1 - i_c]
            c_element_strides[i_c] = <uint32_t>element_strides[rank - 1 - i_c]

        for i_c in range(rank - 1):
            c_global_strides[i_c] = <uint64_t>byte_strides[rank - 2 - i_c]

        # Reverse spatial dimensions for lower/upper corners
        for i_c in range(n_spatial):
            c_pixel_box_lower[i_c] = <int>pixel_box_lower_corner[n_spatial - 1 - i_c]
            c_pixel_box_upper[i_c] = <int>pixel_box_upper_corner[n_spatial - 1 - i_c]


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +270 to +272
view = _get_validated_view(tensor)
desc._source_ref = tensor

Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorMapDescriptor stores _source_ref = tensor, but when tensor is a DLPack producer the pointer/metadata lifetime is governed by the DLPack capsule returned by __dlpack__(). Since the temporary StridedMemoryView (which holds the capsule and calls the deleter in __dealloc__) is not retained, the capsule can be released immediately, potentially invalidating globalAddress for exporters where the capsule owns the backing allocation. Store a strong reference to the StridedMemoryView (or at least its metadata capsule) instead of (or in addition to) the original tensor object.

Copilot uses AI. Check for mistakes.
@rparolin
Copy link
Collaborator Author

/ok to test

@github-actions
Copy link

@rparolin
Copy link
Collaborator Author

/ok to test

1 similar comment
@rparolin
Copy link
Collaborator Author

/ok to test

Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a coordinated effort between C++ and Python: #199 (comment). Can we please look into reusing the C++ implementation (mainly because @fbusato is a TMA expert) and avoid re-implementing it if possible?

@fbusato
Copy link

fbusato commented Feb 27, 2026

Fighting with poor documentation and bugs don't make me an expert :).
Anyway, we provide two main functionalities in this direction:

The implementation of make_tma_descriptor is here https://github.com/NVIDIA/cccl/blob/main/libcudacxx/include/cuda/__tma/make_tma_descriptor.h. Please let me know if there are functionalities that need to be isolated for reuse.

@rparolin rparolin added the cuda.core Everything related to the cuda.core module label Mar 2, 2026
@rparolin rparolin requested a review from Copilot March 3, 2026 23:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

cuda_core/tests/test_tensor_map.py:102

  • This test passes a raw Buffer from dev.allocate() with data_type=FLOAT32. Buffer exports via DLPack as an int8 tensor with shape=(n_bytes,), so the TMA encoder will treat shape[0] as a float32 element count unless the implementation compensates for this. That can create a descriptor that covers 4× more memory than the allocation and hide potential out-of-bounds issues. Prefer wrapping the buffer in _DeviceArray(buf, (1024,), dtype=np.float32) (or StridedMemoryView.from_buffer with the intended shape/dtype) so the descriptor is built from element-count dimensions matching the data type.
        buf = dev.allocate(1024 * 4)  # 1024 float32 elements
        desc = TensorMapDescriptor.from_tiled(
            buf,
            box_dim=(64,),
            data_type=TensorMapDataType.FLOAT32,
        )

cuda_core/tests/test_tensor_map.py:277

  • Same issue as test_from_tiled_1d: building a descriptor from a raw Buffer with data_type=FLOAT32 relies on the implementation translating the buffer's byte-length into a float32 element count. To avoid encoding a descriptor with incorrect global_dim, wrap buf1/buf2 in _DeviceArray(..., dtype=np.float32) (or a StridedMemoryView with the intended dtype/shape) before calling from_tiled() / replace_address().
    def test_replace_address(self, dev, skip_if_no_tma):
        buf1 = dev.allocate(1024 * 4)
        desc = TensorMapDescriptor.from_tiled(
            buf1,
            box_dim=(64,),
            data_type=TensorMapDataType.FLOAT32,
        )

cuda_core/cuda/core/_kernel_arg_handler.pyx:305

  • Support for passing TensorMapDescriptor as a kernel argument is added here, but there’s no test exercising the full path (ParamHolder → cuLaunchKernel) with a real TensorMapDescriptor argument. Given cuda_core/tests/test_launcher.py already validates scalar/buffer argument handling, consider adding a small integration test that launches a kernel taking a CUtensorMap by value and verifies it can be consumed (or at least that the kernel receives the expected 128-byte payload). This will protect against ABI/size/alignment regressions in the argument marshalling logic.
            elif arg_type is tensor_map_descriptor_type:
                prepare_tensor_map_arg(self.data, self.data_addresses, <TensorMapDescriptor>arg, i)
                continue
            elif arg_type is bool:
                prepare_arg[cpp_bool](self.data, self.data_addresses, arg, i)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rparolin and others added 9 commits March 6, 2026 21:26
…time

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused _alloc_device_tensor helper from tests
- Add test for rank > 5 (6D tensor) to verify upper bound validation
- Add NULL check for PyMem_Malloc in prepare_tensor_map_arg

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the replace_address() demonstration into its own self-contained
example (tma_replace_address.py) so each file covers a single concept.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…alidated views alive to avoid DLPack-backed pointer lifetime hazards.

Add explicit tiled element-stride coverage and acknowledge the DLPack include-layout compatibility follow-up in NVIDIA/cccl#7871.

Made-with: Cursor
@cpcloud cpcloud force-pushed the rparolin/tma_feature branch from f4875f6 to 96a3e84 Compare March 7, 2026 02:28
@cpcloud
Copy link
Contributor

cpcloud commented Mar 7, 2026

/ok to test

Probe support in the fixture and skip when cuda.core is built without CUDA 13 im2col-wide support or when the driver/GPU reports CUDA_ERROR_INVALID_VALUE, so unsupported RTXPRO6000 lanes don't block unrelated changes.

Made-with: Cursor
@cpcloud
Copy link
Contributor

cpcloud commented Mar 7, 2026

/ok to test

…safety.

Expose only TensorMapDescriptor in cuda.core, add StridedMemoryView.as_tensor_map(), remove redundant tensor-map fallback packing, and track/check descriptor context/device compatibility before replacement and kernel launch argument packing.

Made-with: Cursor
@cpcloud
Copy link
Contributor

cpcloud commented Mar 7, 2026

/ok to test

Bring back the cu12 feature blocks so pixi can parse the manifest and local test commands no longer fail early with a missing feature error.

Made-with: Cursor
@cpcloud
Copy link
Contributor

cpcloud commented Mar 7, 2026

/ok to test

@cpcloud cpcloud requested a review from leofang March 7, 2026 16:54
Reject CUDA device-local tensors from a different GPU while still allowing CUDA host and managed memory.

Add regression tests for descriptor creation, replace_address, and the shared validation helper.
@rparolin
Copy link
Collaborator Author

/ok to test

@rparolin rparolin requested a review from cpcloud March 11, 2026 16:29
@cpcloud
Copy link
Contributor

cpcloud commented Mar 12, 2026

@leofang Please review. This is now blocked on your review.

@rparolin rparolin enabled auto-merge (squash) March 12, 2026 22:25
@leofang leofang disabled auto-merge March 13, 2026 20:22
Comment on lines +19 to +24
# if __has_include(<dlpack/dlpack.h>)
# include <dlpack/dlpack.h>
# define CUDA_CORE_HAS_DLPACK_H 1
# else
# define CUDA_CORE_HAS_DLPACK_H 0
# endif
Copy link
Member

@leofang leofang Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: we guarantee to have a dlpack.h during build time, but it's not accessible via <dlpack/dlpack.h>, so does it mean we end up with CUDA_CORE_HAS_DLPACK_H == 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is generally good practice in C++ code so you can avoid having to check if a pre-processor macro is defined before checking its value.

#if defined(CUDA_CORE_HAS_DLPACK_H) && CUDA_CORE_HAS_DLPACK_H == 0
  // ...
#endif


#if defined(__has_include)
# if __has_include(<cuda/tma>)
# include <cuda/tma>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: I am confused -- The TMA header was added fairly recently. We build cuda.core against both CUDA 12 & 13 and merge the resulting wheels. Without vendoring the CCCL header, how did we manage to build and make CI green? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only available via CTK until CUDA 13.2, meaning before this week it was not there.
https://nvidia.github.io/cccl/unstable/libcudacxx/extended_api/tma/make_tma_descriptor.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dependent code is only compiled in when the header is available: CUDA_CORE_HAS_CUDA_TMA.

vector.vector[void*]& data_addresses,
TensorMapDescriptor arg,
const size_t idx) except -1:
arg._check_context_compat()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am torn on this. I understand the rational but this call will be very slow since it involves multiple driver API calls.

We should check this when a TMA is constructed in Python, memoize the device/context attributes, and then move on. For example, we don't do pointer attribute check at launch time either. It adds just too much overhead.

Comment on lines +139 to +145
# Allocate a temporary buffer for the 128-byte CUtensorMap struct.
# We copy rather than pointing directly at arg._tensor_map for lifetime
# safety: ParamHolder owns and frees its argument buffers independently.
cdef void* ptr = PyMem_Malloc(sizeof(cydriver.CUtensorMap))
if ptr is NULL:
raise MemoryError("Failed to allocate memory for CUtensorMap")
memcpy(ptr, arg._get_data_ptr(), sizeof(cydriver.CUtensorMap))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary because the driver will copy again. Just pack the pointer and pass it to cuLaunchKernel, and the driver will copy it over.

Comment on lines +87 to +107
// Initialise a single-phase mbarrier (1 arriving thread).
asm volatile(
"mbarrier.init.shared.b64 [%0], 1;"
:: "r"((unsigned)__cvta_generic_to_shared(&mbar)));

// Ask TMA to copy TILE_SIZE floats starting at element 'tile_start'
// from the tensor described by 'tensor_map' into shared memory.
asm volatile(
"cp.async.bulk.tensor.1d.shared::cluster.global.tile"
".mbarrier::complete_tx::bytes"
" [%0], [%1, {%2}], [%3];"
:: "r"((unsigned)__cvta_generic_to_shared(smem)),
"l"(&tensor_map),
"r"(tile_start),
"r"((unsigned)__cvta_generic_to_shared(&mbar)));

// Tell the mbarrier how many bytes the TMA will deliver.
asm volatile(
"mbarrier.arrive.expect_tx.shared.b64 _, [%0], %1;"
:: "r"((unsigned)__cvta_generic_to_shared(&mbar)),
"r"((unsigned)(TILE_SIZE * sizeof(float))));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +115 to +121
asm volatile(
"{ .reg .pred P; \n"
"WAIT: \n"
" mbarrier.try_wait.parity.shared.b64 P, [%0], 0; \n"
" @!P bra WAIT; \n"
"} \n"
:: "r"((unsigned)__cvta_generic_to_shared(&mbar)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now think this example should just be combined with tma_tensor_map.py, since we have lots of code repetition here.

rparolin and others added 3 commits March 13, 2026 14:20
Co-authored-by: Leo Fang <leo80042@gmail.com>
Co-authored-by: Leo Fang <leo80042@gmail.com>
Co-authored-by: Leo Fang <leo80042@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Design the TensorMap object EPIC: Support TMA descriptor

5 participants